Sentence Alignment in Parallel Corpora :

نویسندگان

  • N Collier
  • K Takahashi
چکیده

This report has two aims To give information about the issues behind corpus alignment and the techniques currently used. To describe a particular corpus which members of CCL were involved in constructing-the Asahi corpus. The subject of aligning parallel corpora is expanding rapidly, particularly because the bottom-up machine translation (MT) paradigms such as Example-based MT and Statistics-based MT are looking for large knowledge sources. However, most work has been done on aligning European language corpora such as the Canadian Hansard and this necessarily ignores many of the diicult issues we face when aligning more semantically distant languages such as English and Japanese. The Asahi corpus was constructed from a CD-ROM of newspaper editorials which were automatically aligned using a hybrid statistical-linguistic approach at the Nara Advanced Institute of Science and Technology (NAIST) in Japan. The Asahi editorials appear daily in a national, broadsheet newspaper and tend to comment on subjects in the news headlines. Although previous experiments on a small scale had shown NAIST's technique to be very reliable (less than 4 percent error), CCL researchers required an even smaller error rate and a ne tolerance for sentence alignment. Consequently we decided to check the corpus, some 330,000 words of English and a similar amount of Japanese, using a bilingual human. The report gives the statistical characteristics of the nal corpus and also a detailed subject breakdown.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Sentence Alignment in Parallel, Comparable, and Quasi-comparable Corpora

We explore the usability of different bilingual corpora for the purpose of multilingual and cross-lingual natural language processing. The usability of bilingual corpus is evaluated by the lexical alignment score calculated for the bi-lexicon pair distributed in the aligned bilingual sentence pairs. We compare and contrast a number of bilingual corpora, ranging from parallel, to comparable, and...

متن کامل

Sentence Alignment for Monolingual Comparable Corpora

We address the problem of sentence alignment for monolingual corpora, a phenomenon distinct from alignment in parallel corpora. Aligning large comparable corpora automatically would provide a valuable resource for learning of text-totext rewriting rules. We incorporate context into the search for an optimal alignment in two complementary ways: learning rules for matching paragraphs using topic ...

متن کامل

Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora

We address the problem of unsupervised and language-pair independent alignment of symmetrical and asymmetrical parallel corpora. Asymmetrical parallel corpora contain a large proportion of 1-to-0/0-to-1 and 1-to-many/many-to-1 sentence correspondences. We have developed a novel approach which is fast and allows us to achieve high accuracy in terms of F1 for the alignment of both asymmetrical an...

متن کامل

Vicinity-Driven Paragraph and Sentence Alignment for Comparable Corpora

Parallel corpora have driven great progress in the field of Text Simplification. However, most sentence alignment algorithms either offer a limited range of alignment types supported, or simply ignore valuable clues present in comparable documents. We address this problem by introducing a new set of flexible vicinity-driven paragraph and sentence alignment algorithms that 1-N, N-1, N-N and long...

متن کامل

Hybrid Parallel Sentence Mining from Comparable Corpora

Mining for parallel sentences in comparable corpora is much more difficult than aligning sentences in parallel corpora. Sentence alignment in parallel corpora usually exploits simple empirical evidence (turned into assumptions) such as (i) the length of a sentence is proportional with the length of its translation and (ii) the discourse flow is necessarily the same in both parts of the bi-text ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995